14 research outputs found

    Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents

    Get PDF
    In this chapter, we present our work in realizing information access across different languages and periods. Nowadays, digital collections of historical documents have to handle materials written in many different languages in different time periods. Even in a particular language, there are significant differences over time in terms of grammar, vocabulary and script. Our goal is to develop a method to access digital collections in a wide range of periods from ancient to modern. We introduce an information extraction method for digitized ancient Mongolian historical manuscripts for reducing labour-intensive analysis. The proposed method performs computerized analysis on Mongolian historical documents. Named entities such as personal names and place names are extracted by employing support vector machine. The extracted named entities are utilized to create a digital edition that reflects an ancient Mongolian historical manuscript written in traditional Mongolian script. The Text Encoding Initiative guidelines are adopted to encode the named entities, transcriptions and interpretations of ancient words. A web-based prototype system is developed for utilizing digital editions of ancient Mongolian historical manuscripts as scholarly tools. The proposed prototype has the capability to display and search traditional Mongolian text and its transliteration in Latin letters along with the highlighted named entities and the scanned images of the source manuscript

    A Prototypical Network-Based Approach for Low-Resource Font Typeface Feature Extraction and Utilization

    No full text
    This paper introduces a framework for retrieving low-resource font typeface databases by handwritten input. A new deep learning model structure based on metric learning is proposed to extract the features of a character typeface and predict the category of handwrittten input queries. Rather than using sufficient training data, we aim to utilize ancient character font typefaces with only one sample per category. Our research aims to achieve decent retrieval performances over more than 600 categories of handwritten characters automatically. We consider utilizing generic handcrafted features to train a model to help the voting classifier make the final prediction. The proposed method is implemented on the ‘Shirakawa font oracle bone script’ dataset as an isolated ancient-character-recognition system based on free ordering and connective strokes. We evaluate the proposed model on several standard character and symbol datasets. The experimental results showed that the proposed method provides good performance in extracting the features of symbols or characters’ font images necessary to perform further retrieval tasks. The demo system has been released, and it requires only one sample for each character to predict the user input. The extracted features have a better effect in finding the highest-ranked relevant item in retrieval tasks and can also be utilized in various technical frameworks for ancient character recognition and can be applied to educational application development

    Metadata-related Challenges for Realizing a Federated Searching System for Japanese Humanities Databases

    No full text
    This paper provides a summary of our ongoing project for providing integrated access to Japanese multiple digital libraries, archives, and museums. The main goal to construct a federated searching system for Japanese humanities databases, which searches multiple databases in parallel and provides on-the-fly integration of the results, has required the system to deal with heterogeneous metadata schemas in various formats. In this paper we discuss the metadata-related challenges faced at the front-end for retrieving multiple Japanese databases in parallel and integrating bilingual retrieved results. Aggregation and integration of the retrieved results in English and Japanese are complicated if a search needs to be performed from multilingual sources

    Intuitively Searching for the Rare Colors from Digital Artwork Collections by Text Description: A Case Demonstration of Japanese Ukiyo-e Print Retrieval

    No full text
    In recent years, artworks have been increasingly digitized and built into databases, and such databases have become convenient tools for researchers. Researchers who retrieve artwork are not only researchers of humanities, but also researchers of materials science, physics, art, and so on. It may be difficult for researchers of various fields whose studies focus on the colors of artwork to find the required records in existing databases, that are color-based and only queried by the metadata. Besides, although some image retrieval engines can be used to retrieve artwork by text description, the existing image retrieval systems mainly retrieve the main colors of the images, and rare cases of color use are difficult to find. This makes it difficult for many researchers who focus on toning, colors, or pigments to use search engines for their own needs. To solve the two problems, we propose a cross-modal multi-task fine-tuning method based on CLIP (Contrastive Language-Image Pre-Training), which uses the human sensory characteristics of colors contained in the language space and the geometric characteristics of the sketches of a given artwork in order to gain better representations of that artwork piece. The experimental results show that the proposed retrieval framework is efficient for intuitively searching for rare colors, and that a small amount of data can improve the correspondence between text descriptions and color information
    corecore